Practical compressed string dictionaries

نویسندگان

  • Miguel A. Martínez-Prieto
  • Nieves R. Brisaboa
  • Rodrigo Cánovas
  • Francisco Claude
  • Gonzalo Navarro
چکیده

The need to store and query a set of strings – a string dictionary – arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others, handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefixand substringbased searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Space-efficient Data Structures for Collections of Textual Data

This thesis focuses on the design of succinct and compressed data structures for collections of string-based data, specifically sequences of semi-structured documents in textual format, sets of strings, and sequences of strings. The study of such collections is motivated by a large number of applications both in theory and practice. For textual semi-structured data, we introduce the concept of ...

متن کامل

Frames for compressed sensing using coherence

We give some new results on sparse signal recovery in the presence of noise, for weighted spaces. Traditionally, were used dictionaries that have the norm equal to 1, but, for random dictionaries this condition is rarely satised. Moreover, we give better estimations then the ones given recently by Cai, Wang and Xu.

متن کامل

Data Compression Using a Dictionary of Patterns

Most modern lossless data compression techniques used today, are based in dictionaries. If some string of data being compressed matches a portion previously seen, then such string is included in the dictionary and its reference is included every time it occurs. A possible generalization of this scheme is to consider not only strings made of consecutive symbols, but more general patterns with ga...

متن کامل

Efficient Data Representations for Information Retrieval

The key role compression plays in efficient information retrieval systems has been recognized for some time. However, applying a traditional compression algorithm to the contents of an information retrieval system is often not the best solution. For example, it is inefficient to perform search operations in maximally compressed data or to find the intersection of maximally compressed sets. In o...

متن کامل

Entropy-Compressed Indexes for Multidimensional Pattern Matching

In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Inf. Syst.

دوره 56  شماره 

صفحات  -

تاریخ انتشار 2016